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(54) A tool for investigating the properties of a substance 



(57) A computer-implemented tool for investigating 
the chemical and or biological properties of a substance 
comprising a sequence of nucleic acids or amino acids, 
said tool comprising means for displaying in a graphical 
display scheme at least one substance having at least 
one sequence similar or identical to a sequence of the 
substance to be investigated in such a manner that sim- 
ilar and/or identical sequences are respectively dis- 



played through a corresponding first graphical indication 
the position of which with respect to a first dimension 
corresponds to the location of said similar sequence in 
said substance, and means for displaying a second 
graphical indication corresponding to each first graphi- 
cal indication and indicating a domain information cor- 
responding to the sequence indicated by said first 
graphical indication. 
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Description 

FIELD OF THE INVENTION 

[0001] The present invention relates to a tool for in- 
vestigating the chemical or biological properties of sub- 
stances, 

DESCRIPTION OF THE RELATED ART 

[0002] Scientists in the field of biology or biochemistry 
often encounter new substances which they have iso- 
lated or otherwise identified and for which they want to 
know more about their behavior, their capabilities, their 
functions , etceteras . 

[0003] For that purpose it is common to consult data- 
bases in which a large volume of information with re- 
spect to substances, i.e. for example their structure but 
also at least some of their properties and functions. It is 
general knowledge that similar structural parts may be 
responsible for a similar behavior, similar properties and 
similar functions, and this is the basic reasoning behind 
the consulting of such databases. 
[0004] There exist many databases in which nucleic 
acid sequences or amino acid sequences are stored, 
often together with an identifier and together with some 
additional information further characterizing the stored 
sequence with respect to ist structure or its function. As 
an example, a nucleic acid may contain repetitive ele- 
ments such as multiple copies of the three nucleotides 
"A, C, and A", e.g. "ACA ACA ACA". The annotation of 
the sequence with respect to the content of such a re- 
petitive element could most likely be stored as an addi- 
tional information in a database in which the corre- 
sponding sequence is stored. 

[0005] Other elements may involve a particular func- 
tion, such as the capability of a protein to metabolize a 
substrate. 

[0006] Many proteins, also called enzymes if they ex- 
hibit a function, show numerous features or elements of 
which much may be known. In proteins, such functional 
elements are referred to as "domains". E.g., proteins 
that are capable of binding a DNA molecule will contain 
a "DNA binding domain". 

[0007] Throughout the various organisms one may 
find enzymes that perform similar functions. Often such 
a function may be deduced by comparing protein do- 
mains. 

[0008] One possibility could be to directly search for 
known domains in the substance to be investigated. 
This may lead to an initial understanding about the prop- 
erties of the substance to be investigated, since identi- 
fied domains give an indication about a certain function. 
[0009] An alternative method follows a more elemen- 
tary approach, namely to search for such substances 
which have sequences bearing similarities with se- 
quences of the substance to be investigated. An un- 
known substance, such as a protein, a gene, an en- 
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zyme, or the like is used as a query string inputted into 
a database, in other words, the sequence of the un- 
known substance is used as input. It is then searched 
for substances the sequences of which bear some sim- 
s ilarities with the query sequence, hoping that this will 
give the scientists a hint about the properties of the un- 
known substance. 

[0010] The output of such a query usually consists of 
a list of sequences together with a corresponding name 
10 or identifier as well as a numerical figure indicating the 
degree of coincidence between query and result. This 
is shown in Fig. 1. 

[001 1 ] In some cases there is also a graphical indica- 
tion in the result display as to where the similar sequenc- 
es es are located in order to thereby give some graphical 
indications as to where the two sequences share char- 
acter states (domains). This is schematically illustrated 
in Fig. 2, where the result sequences 200, 201 , 202 com- 
prise indications 21 1 , 21 2, 21 3, 21 4 as to where the sim- 
ilarities with the query sequence are located. 
[0012] Given the previous approaches, there is a 
need for an improved tool for investigating the properties 
of unknown substances. A substantial problem is that 
the chemical and biological properties of a substance 
comprising a sequence of nucleic acids or amino acids 
are not readily apparent when a database output is an- 
alyzed, in particular with respect to their distribution 
within a substance of interest but also within related sub- 
stances of interest. 

SUMMARY OF THE INVENTION 

[0013] The present invention provides a tool for inves- 
tigating the properties of a substance based on similar- 
ities of said substance with known substances. This can 
be achieved by not only displaying first graphical indi- 
cations of locations of similar sequences but also sec- 
ond graphical indications of domains corresponding to 
said similar sequences. Herein, domains are to be un- 
derstood as regions of distinct structure to which biolog- 
ical or chemical information is available. In the case of 
peptides such domains may be independent regions of 
the polypeptide chain, which can on the one hand be 
distinguished by the way they are folded (structure of 
space) and their movability, and which, on the other 
hand, often have separate functions within the protein 
and represent genetic modules which are combined to 
different genes and thus proteins. In the case of nucleic 
acids such domains are elements encoding the above 
peptide domains or other distinct elements found in nu- 
cleic acids, such as repetitive elements or the like. Thus, 
in its broadest definition a domain is a distinct feature 
element for which some information is available, e.g. 
about its function or distribution amongst the various or- 
ganisms on earth. Thereby, it becomes possible for the 
scientists to grasp at a single glance an hint with respect 
to the behavior of an unknown substance. Rather than 
giving only statistical information as in the prior art, the 



EP 1 199 668 A2 



25 



30 



35 



40 



45 



50 



2 



BNSDOCID: <EP. 



1199668A2_L> 



i? 



EP 1 199 668 A2 4 

base. In another embodiment, a unique iden- 
tifier is sent off as a query string. 

Fig. 4 shows a display scheme according to an em- 
s bodiment of the present invention. 

Fig. 5 shows a display scheme for a further result ac- 
cording to an embodiment of the present in- 
vention 

to 

Fig. 6 schematically illustrates a computer system to 
be used in connection with an embodiment of 
the present invention. 

15 Fig. 7 shows an output of the computer-implemented 
tool according to the invention. Here, the query 
has resulted in the return of sequences 710, 
720, 730, 740, 750, 760, and 770. In the case 
of "result" sequence 710, 710, the darker bar 
20 designates and comprises regions of similarity 

715. 
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scientist is provided with actual "biological" or "biochem- 
ical" information about the probable behavior of the sub- 
stance to be investigated. Generally, if a novel protein 
or a novel gene coding for a protein is found and the 
function of which is so far unknown, it is a good approach 
to start by looking at which domains may be identified. 
Based on this knowledge, the function of the protein may 
be rendered obvious or at least there may be provided 
significant hints for an initial understanding of the pro- 
tein's function. 

[001 4] Preferably, the first graphical indication has the 
form of a line, the location of the line indicating the lo- 
cation of the similar sequence, an the length of the line 
indicating the length of the similar sequence. This 
makes it easy to get at a first glance an impression about 
which of the similar sequences may be give significant 
hints to corresponding properties (the longer ones) and 
where they are located. 

[0015] Preferably, the first graphical indications are 
for different sequences located at different positions in 
a second dimension, such as they-axis. This makes the 
display particularly easy to look at, with none of the first 
graphical indications overlapping each other. 
[001 6] Preferably, the second graphical indication has 
the form of a pop-up window which when clicking on a 
first graphical indication pops up and delivers further in- 
formation regarding the domain corresponding to said 
similar sequence, 

[0017] The sequence of the substance to be investi- 
gated is preferably used as an input to query a database 
which then returns substances having similar sequenc- 
es. The results may then be displayed as indicated 
above. 

[0018] The present invention will now be explained in 
more detail trough exemplary embodiments in connec- 
tion with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[0019] 

Fig. 1 shows a display scheme according to the prior 
art.A query sequence has been sent to a da- 
tabase as a query string input namely to 
search for such substances which have se- 
quence parts bearing similarities with the se- 
quence to be investigated. The display accord- 
ing to the prior art is the result returned. 

Fig. 2 shows another display scheme according to 
the prior art. In this case the regions are depict- 
ed which show similarity. 

Fig. 3 shows a database query used in connection 
with an embodiment of the invention. In this 
preferred embodiment of the invention, the 
query string is a sequence. It is used to initiate 
a search for similar sequences in the data- 



A mouse over would result in a pop-up or alternatively 
a description to the right showing the highest score pairs 

25 (HSP) regions. Bars 712, 713 and 714 represent fea- 
tures or domains contained within the sequence. A 
mouse-over results in a visual description of the location 
and characteristics of the domains. For sequence 750 
a mouse-over has been performed at position 751 , re- 

30 suiting in description 752. It is equally possible to double 
click on any of the feature bars, e.g. 714, 713, 712 and 
751 and, thereby arrive at a page providing further in- 
formation. 



[0020] Fig. 3 schematically shows a situation in which 
an embodiment of the present invention may be used. 
A substance 300 has to be investigated with respect to 
40 its biological or biochemical properties, it consists of a 
nucleic acid sequence comprising nucleotides 300 or al- 
ternatively an amino acid sequence comprising amino 
acids. 

[0021] In order to investigate the substance it is used 
45 as a an input to query a database 310 in which a large 
volume of sequences is stored, possibly with additional 
information for each sequence which may contain not 
only its name or identifier, but also additional information 
with respect to the domains known for the sequences 
so stored, their corresponding functions, etceteras. 

[0022] The query procedure can be carried out in a 
manner which is easy to implement by any person 
skilled in the art, it may use any database technology 
conventionally known and is therefore not explained in 
55 detail here. It should only be mentioned that the partic- 
ular search strategy or search conditions, such as for 
which degree of similarity between query sequence and 
a dataset stored in database 310, "a hit", i.e. a query 
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result is returned, depends on the methodology used for 
querying, i.e. in a preferred embodiment this is a simi- 
larity search tool that searches the database. The query 
result returned also depends on the content of the da- 
tabase. In a preferred embodiment this similarity search s 
tool is the BLAST algorithm^ Aitschul, Stephen R, War- 
ren Gish, Webb Miller, Eugene W. Myers, and David J, 
Lipman (1990). Basic local alignment search tool. J. 
Mol. Biol. 215:403-10. 

[0023] Another search tool that may be used is FASTA 10 
(W. R. Pearson and D. J. Lipman (1988), "Improved 
Tools for Biological Sequence Analysis", PNAS 85: 
2444- 2448, and W. R. Pearson (1 990) "Rapid and Sen- 
sitive Sequence Comparison with FASTP and FASTA" 
Methods in Enzymology 1 83:63- 98). 15 
[0024] The invention is not limited to one particular 
type of search tool. In one embodiment of the invention 
search tools are used that do not search by sequence 
similarity but by applying sequence profiles such as a 
profile generated when applying the Profile Hidden 20 
Markov Model. 

[0025] Profile Hidden Markov Models also called "Hid- 
den Markov Models", here abbreviated as HMM, are 
statistical models representing the consensus of the pri- 
mary structure of a sequence family. The profiles use 25 
scores specific of the position of amino acids (or nucle- 
otides) and position specific scores for the opening or 
the expansion of an insertion or deletion. Methods for 
the creation of profiles, starting from multiple align- 
ments, have been introduced by Taylor (1 986) , Gribskov 30 
et a!. (1987), Barton (1990) and Heinikoff (1996). 
[0026] HMMs provide an utterly probabilistic descrip- 
tion of profiles, i.e. Bayes' theory rules the positioning 
of all probability (evaluation) parameters (compare 
Krogh et al. 1 994, Eddy 1 996 und Eddy 1998). The cen- 35 
tral idea behind this is that a HMM is a finite model de- 
scribing the probability distribution of an infinite number 
of possible sequences. The HMM consists of a number 
of states corresponding with the columns of a multiple 
alignment as it is usually depicted. Each state emits 40 
symbols (remainders) corresponding with the probabil- 
ity of the symbol emission (specific of the respective 
state), and the states are linked with each other by prob- 
abilities of the changing of states. Starting from one spe- 
cific state, a succession of states is generated by chang- 4s 
ing from one state to the other in accordance with the 
probability of the changing of states, until a final state 
has been reached. Each state then emits symbols ac- 
cording to the probability distribution of emissions spe- 
cific of this state, creating an observable sequence of 50 
symbols. 

[0027] The attribute "hidden" has been derived from 
the fact that the underlying sequence of states cannot 
be observed. Only the sequence of symbols is visible. 
An assessment of the probabilities of changing of states 55 
and of emissions (the training of the model) is achieved 
by dynamic programming algorithms implemented in 
the HMMER package. 



[0028] If an existing HMM and a sequence are given, 
the probability that the HMM could generate the se- 
quence in question, can be calculated. The HMMER 
package provides a numerical quantity (the Score) in 
proportion to this probability, Le. the information content 
of the sequence indicated as bits, measured according 
to the HMM. 

[0029] See also Barton, G.J. (1990): Protein multiple 
alignment and flexible pattern matching, Methods Ezy- 
moi. 183: 403-427, Eddy, S.R. (1996): Hidden markov 
models. Curr. Opin. Strct. Biol. 6: 361-365, Eddy, S.R. 
(1998): Profile hidden markov models.Bioinformatics. 
14: 755-763, Gribskov, M. McLachlan, A.D. und Eisen- 
berg D. (1987): Profile analysis: Detection of distantly 
related proteins. Proc. Natl. Acad. Set. USA. 84: 
4355-5358, Heinikoff, S. (1996): Scores for sequence 
searches and alignment, Curr. Opin. Strct. Biol. 6: 
353-360, Krogh, A., Brown, M., Mian, I.S., Sjolander, K. 
und Haussler, D. (1994): Hidden markov models in com- 
putational biology: Applications to protein modelling. J. 
Moi. Biol. 235: 1501-1531 /Taylor, W.R. (1986): Identifi- 
cation of protein sequence homology by consensus 
template alignment. J. Mol. Biol. 188: 233-258. 
[0030] In general the sequence are selected such that 
a query using a search sequence returns a result con- 
sisting of sequences which are at least to a certain de- 
gree similar to the query sequence. 
[0031] After one or more result datasets have been 
returned by database 31 0, they are displayed as sche- 
matically illustrated in Fig. 4. 

[0032] The line 400 shown in Fig. 4 indicates a first 
hit returned from a database query based on a query 
string. Two other query results 420 and 430 are also 
shown in Fig. 4. Result 400 consists of a sequence 
which has been found as bearing at least some similarity 
to the sequence used as query sequence. The query 
string itself may be displayed in a separate window not 
shown in Fig. 4. 

[0033] For result 400 there are four (partial) sequenc- 
es in which result 400 is similar to the query string, they 
are graphically indicated by the lines 401 , 402, 403, and 
404. In order to avoid an overlap between lines 401 to 
404 they are respectively located at different positions 
in the x-direction. The length of those lines Indicates the 
extent of coincidence with the query string, in case of 
sequence 404 there is e.g. a large degree of coinci- 
dence between query and result, to the contrary, in case 
of sequence 403 the degree of coincidence is relatively 
small. Degree of coincidence here means the length of 
a sequence part in which query and result coincide with 
each other. 

[0034] There is still further information which a scien- 
tist can derive from the display scheme shown in Fig. 4, 
namely where in the query the similar sequences are 
located. This already may give some indication about 
the function of some of the sequences, the information 
can be derived from the respective centers of lines 401 
to 404, as indicated by the dashed lines in Fig. 4. The 
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query sequence itself may also be displayed such as by 
the dashed line 415 in Fig. 4, although this is actually 
not necessary. The results 400, 420 and 430 are pref- 
erably normalized in their length in the y-direction to the 
length of the query sequence. 

[0035] Additional information with respect to the do- 
mains (domain information) related to the similar se- 
quences is displayed for example in a pop-up window 
410 which shows up when e.g. mouse pointer 405 
moves close to line 404 (when the mouse comes over 
line 404) or when the user clicks on line 404. Then the 
user is provided with additional information regarding 
the domain(s) related to the sequence identified as sim- 
ilar. Such information may contain the function name of 
that domain, it may also contain additional information 
about ist structure, or any information stored in a data- 
base about that domain. 

[0036] From that information a scientist can derive ex- 
tremely valuable hints as to which are the functions or 
properties of the unknown substance. If e.g. the results 
show that the query and the results sh are the "DN A bind- 
ing domain" as the domain information, then this gives 
an indication that the unknown substance is a protein 
that binds DNA rather than, e.g. digesting (cutting) it. In 
the popup window not only domain information may be 
provided. It may happen that a sequence part found to 
be similar does not belong to a certain domain but rather 
has other features or character states such as that it 
contains a repetitive element or the like. This kind of in- 
formation could show up as well in the popup window. 
The popup window may contain any information found 
in the database about the element or domain which has 
been found to be similar to the query sequence. It should 
be noted that a popup window is only the preferred 
graphical representation of the additional information 
provided. Other possibilities are, e.g. depicting the in- 
formation directly in the same window. 
[0037] In Fig. 4 three hits (results) 410, 420, 430 to- 
gether with additional graphical information regarding 
the location of similar sequences and the corresponding 
domains are schematically illustrated. In a preferred em- 
bodiment, particular sequence features, e.g. domains, 
may be attributed defined graphical representations. 
[0038] Fig. 5 illustrates another example of a result 
provided by the computer implemented tool according 
to the invention. It becomes readily apparent that the 
result set 520, 530, 540, and 550 obtained for the query 
sequence 51 0 contains two groups of results. In the first 
group, results 520 to 550 all contain a sequence 545 
corresponding to a DNA binding domain, as becomes 
apparent from a popup window (not shown), which 
shows up when the cursor moves to one of the sequenc- 
es 545. The other group, 540 and 550, however, contain 
a domain 555 corresponding to a DNA polymerization 
domain, i.e. activity for synthesizing DNA. 
[0039] Moving the mouse over domain 560 will result 
in the information that this is a domain found in enzymes 
that digest (degrade) DNA. 
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[0040] By applying the invention, the user has thus 
learned that there are two groups of substances that are 
similar to the query substance, having in common the 
capability to bind DNA. A first group capable of digesting 
5 DNA, to which the query sequence belongs, and a sec- 
ond group capable of synthesizing DNA. 
[0041] In another embodiment (not shown here) by 
using a user option a user may change the display of 
Fig. 5 such that instead on only similar sequences or 
10 domains all the domains known for results 510 to 550 
are displayed. This would then result for each of the se- 
quences 51 0 to 550 in a display as if it would have been 
itself the query sequence. With this option a user may 
then e.g. find that sequences 540 and 550 contain do- 
te mains 570 which are responsible for a different function 
such as an "intern" (an interspersed element) or any oth- 
er known element. The similarity between those results 
and the query sequence, however, clearly is not based 
on a similarity regarding this function but on other simi- 
20 larities. 

[0042] Another user option may be that a user can 
choose to get displayed all the sequences in the query 
which belong to already identified domains, irrespective 
of the result sequence in which they occur. Basically this 

25 would consist in replacing lines 51 0 to 550 by a line rep- 
resenting a query and further by displaying all domains 
displayed in Fig. 5 more than once just once. This would 
give a concise display of all domains related to the que- 
ry, although this scheme of displaying may become 

30 quite complicated to overlook if many domains (or cor- 
responding sequences) are to be displayed. 
[0043] Many variations to the above-mentioned em- 
bodiment can be imagined. For example not only the 
domains for which a similarity of sequences between 

35 query and result exits could be displayed, but in general 
all domains known for a found result. This could e.g. be 
provided as a further option only to be displayed when 
requested by the user. 

[0044] It can e.g . also be imagined that different colors 

40 are used for lines 401 to 404 indicating similar sequenc- 
es, the colors e.g. coding for the most frequent domains 
or functions, such as DNA binding or not DNA binding, 
or the like. Then even more information could be directly 
derived from a display as shown in Fig. 4, even without 

45 apop-up window. Which color codes for what may there- 
by be different from implementation to implementation. 
[0045] There may also be provided a further window 
in which e.g. the name of the query and the sequences 
found may be listed, possibly such that the sequence 

50 where the mouse is located in Fig. 4 is the one also 
shown as the first one in the additional window. 
[0046] Embodiments of the present invention may be 
implemented using a computer system as schematically 
illustrated in Fig. 5. A computer system 600 may com- 

55 prise a computer 605 connected to a display 610, a 
mouse 620, and comprising some storage medium 630 
such as a floppy disk drive, a CD-ROM drive or the like, 
and some hardware components 640 comprising at 
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gated result from a query of a database using said 
substance to be investigated as the query input for 
finding through said query said one or more sub- 
stances having similar or identical sequences. 

5 

6. The computer-implemented tool according to 
claim 5, wherein 

said query is performed using a method chosen 
from the group comprising, BLAST, FASTA and/or 
10 the Profile Hidden Markov Model. 

7. The computer-implemented tool according to one 
of the preceding claims, further comprising: 

15 means for displaying in response to a user re- 

quest for a result sequence 
ail domains contained in the result set, such as 
if the result would Itself have been the query 
sequence, and/or 
20 means for displaying in response to a user re- 

quest all domains identifiable in said query se- 
quence. 

8. A method for investigating the chemical and or 
25 biological properties of a substance comprising a 

sequence of nucleic acids or amino acids, said 
method comprising: 

means for displaying in a graphical display 
30 scheme at least one substance having at feast 

one sequence similar or identical to a sequence 
of the substance to be investigated in such a 
manner that similar and/or identical sequences 
are respectively displayed through a corre- 
35 spending first graphical indication the position 

of which with respect to a first dimension cor- 
responds to the location of said similar se- 
quence in said substance; and 
means for displaying a second graphical indi- 
*o cation corresponding to each first graphical in- 

dication and indicating a domain information 
corresponding to the sequence indicated by 
said first graphical indication. 



least one CPU and a memory such as to enable the 
computer to carry out by means of the CPU program 
instructions stored in said memory. The program itself 
may be stored on any computer readable medium, or it 
may by carried out on a remote computer (host) to be 
accessed by a client computer through a communica- 
tions link such as the internet. Hybrid implementations 
are also possible, such as a Java implementation being 
downloaded in part or as a whole through a network and 
carried out on a client. 



Claims 

1. A computer-implemented tool for investigating 
the chemical and or biological properties of a sub- 
stance comprising a sequence of nucleic acids or 
amino acids, said tool comprising; 

means for displaying in a graphical display 
scheme at least one substance having at least 
one sequence similar or identical to a sequence 
of the substance to be investigated in such a 
manner that similar and/or identical sequences 
are respectively displayed through a corre- 
sponding first graphical indication the position 
of which with respect to a first dimension cor- 
responds to the location of said similar se- 
quence in said substance; and 
means for displaying a second graphical indi- 
cation corresponding to each first graphical in- 
dication and indicating a domain information 
corresponding to the sequence indicated by 
said first graphical indication. 

2. The computer-implemented tool according to 
claim 1 , wherein 

said first graphical indication has the form of a line, 
the length of said line indicating the length of said 
similar sequence, and the position of said line indi- 
cating the location of said similar sequence. 

3. The computer-implemented tool according to 
claim 1 or 2, wherein 
said first graphical indications for different sequenc- 45 
es are located at different positions in a second di- 
mension. 

4. The computer-implemented tool according to one 

of the preceding claims, wherein 50 
said second graphical indication is a pop-up window 
showing up in response to an action performed by 
a user. 

5. The computer-implemented tool according to one ss 
of the preceding claims, wherein 

said one or more substances having sequences 
similar or identical to the substance to be investi- 



9. The method according to claim 8, wherein 
said first graphical indication has the form of a line, 
the length of said line indicating the length of said 
similar sequence, and the position of said line indi- 
cating the location of said similar sequence. 

10. The method according to claim 8 or 9, wherein 
said first graphical indications for different sequenc- 
es are located at different positions in a second di- 
mension. 

1 1 . The method according to one of claims 8 to 1 0, 
wherein 

said second graphical indication is a pop-up window 
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showing up in response to an action performed by 
a user. 

12. The method according to one of claims 8 to 1 1 , 
further comprising 5 
querying a database with said substance to be in- 
vestigated as an input to obtain as a result one or 
more substances having sequences similar or iden- 
tical to sequences of said substance to be investi- 
gated. 10 

12. The method according to claim 12, wherein 
said query is performed using a method chosen 
from the group comprising, BLAST, FASTA and/or 
the Profile Hidden Markov Model. is 

13. The method according to one of claims 7 to fur- 
ther comprising 

displaying in response to a user request for a 20 
result sequence 

all domains contained in the result set, such as 
if the result would itself have been the query 
sequence, and/or 

displaying in response to a user request all do- 25 
mains identifiable in said query sequence. 

14. A computer program comprising computer pro- 
gram code for enabling a computer to carry out a 
method according to one of claims 6 to 12. 30 



35 



40 



45 



50 
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Fig. 1 



Sequences producing significant alignments: 

gb{ JQ4639. 1 1 TTHTAQP1A Thermus aquatic us DNA polymerase (Pol. 
dbj | D32Q13 . 1 j TTHDNAP Thermus aquaticus gene for DNA polymer. 



Score E 
(bits) Value 



emb| AX01767Q. 1\ AX017670 
emfa| AXQ17672 . If AXQ17672 
embj AX017668 . 1 1 A2C0 17 6 68 
embj AXQ17673 ■ 1\ AS017673 
embj ASQ17669. 1 j AXQ17669 
embj AXQ17671. If AXQ17 671 
dbj j AB02 5788 ■ 1 j ABQ25788 



Sequence 3 from Patent TO9947649 
Sequence S from Patent U09947649 
from Patent U09947649 
from Patent W09947649 
from Patent W09947649 
from Patent HG9947649 



1 
6 
2 
4 



Sequence 
Sequence 
Sequence 
Sequence 

Expression vector pLED-HB TTHPOLA, „ . . 
dbj l P28878 . 1 1 TTHPOLA Thermus thermophilus polA gene for the. . • 
dbj j ABQ25789 . If ABQ25789 Expression vector pLED-HBB gene for . .. . 
gbj P62584 . 1 [ TAU62584 Thermus aquaticus caldophilus thermost... 
embj X661Q5. 1 1 TFPOI.DNA Thermus f lavus pol gene for DNA polym. . „ 
gb| AFQ3032Q . 1 \ AFQ3Q320 Thermus f iliformis thermostable DNA . . 
gb} AEQQ2012 . lj AEQ02Q12 Deinococcus radiodurans Rl section 1... 
gtoj L14581. l l DEIDNAPOLY Deinococcus radiodurans DNA polymera. « . 
embj AL 109732 . 1 f SC7H2 Streptomyces coelicolor cosmid 7H2 
gb 1 AF12 1780 . 1 1 AF 12 1780 Rhodothermus obamensis DNA polymerase, 
gbj AF028719. 1 1 AF028719 Rhodothermus sp. * ITI 518' DNA polym. . . 
ref |NH 008831. 1| Hus musculus prohibitin (Phb) , mRNA 
emb 1 X78682 . 1 J HMBAP32 H- musculus mRNA for B-cell receptor as... 
dbj 1 P87Q17 .2 | D87017 Homo sapiens immunoglobulin lambda gene..* 

dbj | ABQ14599. 1| ABQ14599 Homo sapiens mRNA for KIAA0699 prot 

gb) AF 102 2 43 . jj AF1Q2243 Odocoileus hemionus DeerC106 tetranu. . . 
emb 1 AJ242 630 . 1 1 HSP242 630 Kethylobacterium sp. DH4 polA gene 
ref j NH 002714. 1| Homo sapiens protein phosphatase 1, reguia. . . 
dbj | AK023772 . 1 | AK023772 Homo sapiens cDNA FLJ13710 fis, clo..» 
embj Y13247. 1 1 HSFB19 Homo sapiens fbl9 mRNA 

dbj | AK000582 ■ 1 1 AKOQ0582 Homo sapiens cDNA FLJ20575 fis, clo. . . 
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Fig. 4 
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Fig. 5 
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